Systems and methods for incremental natural language understanding

ABSTRACT

A system for incremental natural language understanding includes a media module, a memory storing a software code, and a hardware processor communicatively coupled to the media module. The hardware processor is configured to execute the software code to receive an audio stream including a first utterance, and generate a first and second incremental speech recognition outputs based on first and second portions of the first utterance. In addition, the hardware processor is configured to execute the software code to determine, prior to generating the second incremental speech recognition output, a first intent of the first utterance based on the first incremental speech recognition output. The hardware processor is further configured to execute the software code to retrieve a first resource based on the determined first intent, and incorporate the first resource in the media content to be played by the media module.

BACKGROUND

Spoken Language Understanding (SLU) typically comprises of an automaticspeech recognition (ASR) followed by a natural language understanding(NLU) module. The two modules process signals in a blocking sequentialfashion, i.e., the NLU often has to wait for the ASR to finishprocessing on an utterance. In a real-time application scenario, the ASRreceives a stream of continuous speech signals and outputs correspondingtranscriptions. Due to the computational complexity and memoryconstraints, most ASRs typically operate by chunking and processing thespeech in segments. This process is often referred to as end-pointing,and is usually determined based on different heuristics related toduration of inter-pausal units (IPUs), with the goal to minimizedisruption during speech. Finally, the ASR outputs a transcriptcorresponding to each speech segment. As a result, any NLU applicationoperating on the output of the ASR needs to wait at least untilend-pointing, which gives rise to a fundamental bottleneck in latency,and potentially renders the spoken interaction less naturally.

SUMMARY

There are provided systems and methods for incremental natural languageunderstanding, substantially as shown in and/or described in connectionwith at least one of the figures, and as set forth more completely inthe claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a diagram of an exemplary system configured to incorporateresources in media content based on incremental natural languageunderstanding, according to one implementation;

FIG. 2 shows a more detailed diagram of an exemplary system configuredto incorporate resources in media content based on incremental naturallanguage understanding, according to one implementation; and

FIG. 3 is a flowchart presenting an exemplary method for use by a systemto incorporate resources in media content based on incremental naturallanguage understanding, according to one implementation.

DETAILED DESCRIPTION

The following description contains specific information pertaining toimplementations in the present disclosure. One skilled in the art willrecognize that the present disclosure may be implemented in a mannerdifferent from that specifically discussed herein. The drawings in thepresent application and their accompanying detailed description aredirected to merely exemplary implementations. Unless noted otherwise,like or corresponding elements among the figures may be indicated bylike or corresponding reference numerals. Moreover, the drawings andillustrations in the present application are generally not to scale, andare not intended to correspond to actual relative dimensions.

FIG. 1 shows a diagram of an exemplary system configured to incorporateresources in media content based on incremental natural languageunderstanding, according to one implementation. As shown in FIG. 1,exemplary system 100 includes server 102 having hardware processor 104and system memory 106 implemented as a non-transitory storage devicestoring incremental natural language understanding (NLU) software code108. In addition, system 100 includes client device 112 havingmicrophone 140, camera 142, and media module 144. Also shown in FIG. 1are network 110, media content database 116, resource database 118, aswell as user 114.

Client device 112 is configured to generate an audio stream includingutterances. In one implementation, user 114 may interact with clientdevice 112 by speaking utterances, and microphone 140 may generate anaudio stream including the utterances. It is noted that microphone 140may be implemented using multiple microphones, such as a microphonearray, rather than an individual microphone. In another implementation,media module 144 may generate an audio stream including utterances basedon an audio track of media content to be played by media module 144. Invarious implementations, client device 112 may be a smartphone,smartwatch, tablet computer, laptop computer, personal computer, smartTV, home entertainment system, or gaming console, to name a fewexamples. In one implementation, client device 112 may be a voicemodulator in a costume mask. As another example, client device 112 maytake the form of a kiosk in a theme park.

Client device 112 may also be configured to generate video data. User114 may interact with client device 112 by gazes and gestures, andcamera 142 may generate video data including the gazes and gestures. Itis noted that camera 142 may be implemented using multiple camerasrather than an individual camera. In another implementation, mediamodule 144 may generate video data including gazes or gestures based ona video track of media content to be played by media module 144.

As described further below, media module 144 of client device 112 isconfigured to play a media content. In one implementation, media contentto be played by media module 144 may be received from media contentdatabase 116 via network 110. In another implementation, media contentto be played by media module 144 may be generated by media module 144 orclient device 112, for example, based on audio or video data receivedfrom microphone 140 and/or camera 142 respectively. Although clientdevice 112 in FIG. 1 is shown to include microphone 140, camera 142, andmedia module 144, it is noted that any of microphone 140, camera 142,and media module 144 may be separate from each other, for example, asstandalone devices communicatively coupled to each other and/or tonetwork 110.

According to the exemplary implementation shown in FIG. 1, server 102,client device 112, media content database 116, and resource database 118are communicatively coupled via network 110. Network 110 enablescommunication of data between server 102, client device 112, mediacontent database 116, and resource database 118. Network 110 maycorrespond to a packet-switched network such as the Internet, forexample. Alternatively, network 110 may correspond to a wide areanetwork (WAN), a local area network (LAN), or included in another typeof private or limited distribution network. Network 110 may be awireless network, a wired network, or a combination thereof. Server 102,client device 112, media content database 116, and resource database 118may each include a wireless or wired transceiver enabling transmissionand reception of data.

Media content database 116 provides media content to be played by mediamodule 144. Media content may include movie content, televisionprogramming content, or videos on demand (VODs), for example, includingultra high-definition (ultra HD), HD, or standard-definition (SD)baseband video with embedded audio, captions, timecode, and otherancillary data, such as ratings and/or parental guidelines. In someimplementations, media content may include multiple audio tracks, andmay utilize secondary audio programming (SAP) and/or Descriptive VideoService (DVS), for example. It is noted that although FIG. 1 depictsmedia content database 116 as a standalone component, in otherimplementations, media content database 116 may be included in one ormore computing platforms. For example, media content database 116 may beincluded in a media content distribution platform, or may reside inmemory 106 of server 102 or in client device 112.

Hardware processor 104 of server 102 is configured to executeincremental NLU software code 108 to receive an audio stream includingutterances from client device 112 via network 110. Hardware processor104 may also be configured to execute incremental NLU software code 108to receive video data including gazes and gestures from client device112 via network 110. As will be described in greater detail below,hardware processor 104 may be further configured to execute incrementalNLU software code 108 to generate incremental speech recognition outputsbased on portions of the utterances, as well as to generate incrementalgaze recognition outputs or incremental gesture recognition outputsbased on portions of the video data.

Hardware processor 104 may be the central processing unit (CPU) forserver 102, for example, in which role hardware processor 104 runs theoperating system for server 102 and executes incremental NLU softwarecode 108. Hardware processor 104 may also be a graphics processing unit(GPU) or an application specific integrated circuit (ASIC). Memory 106may take the form of any computer-readable non-transitory storagemedium. The expression “computer-readable non-transitory storagemedium,” as used in the present application, refers to any medium,excluding a carrier wave or other transitory signal that providesinstructions to a hardware processor of a computing platform, such ashardware processor 104 of server 102. Thus, a computer-readablenon-transitory medium may correspond to various types of media, such asvolatile media and non-volatile media, for example. Volatile media mayinclude dynamic memory, such as dynamic random access memory (dynamicRAM), while non-volatile memory may include optical, magnetic, orelectrostatic storage devices. Common forms of computer-readablenon-transitory media include, for example, RAM, programmable read-onlymemory (PROM), erasable PROM (EPROM), and FLASH memory.

It is noted that although FIG. 1 depicts incremental NLU software code108 as being located in memory 106, that representation is merelyprovided as an aid to conceptual clarity. More generally, server 102 mayinclude one or more computing platforms, such as computer servers forexample, which may be co-located, or may form an interactively linkedbut distributed system, such as a cloud based system, for instance. As aresult, hardware processor 104 and memory 106 may correspond todistributed processor and memory resources within system 100. Thus, itis to be understood that incremental NLU software code 108 may be storedremotely within the distributed memory resources of system 100.

As will be described in greater detail below, hardware processor 104 isconfigured to execute incremental NLU software code 108 to retrieveresources, such as audio, video, and other resources. Hardware processor104 is further configured to execute incremental NLU software code 108to incorporate the resources in media content to be played by mediamodule 144. In one implementation, server 102 may retrieve resources tobe incorporated into media content from resource database 118 vianetwork 110. It is noted that although FIG. 1 depicts resource database118 as a standalone component, in other implementations, resourcedatabase 118 may be included in one or more computing platforms. Forexample, resource database 118 may reside in memory 106 of server 102 orin client device 112.

FIG. 2 shows a more detailed diagram of an exemplary system configuredto incorporate resources in media content based on incremental naturallanguage understanding, according to one implementation. According tothe present exemplary implementation, system 200 includes incrementalNLU software code 208, client device 212, and resource database 218.Incremental NLU software code 208 includes modules for signal processing220, intent determination 222, entity recognition 228, resourceretrieval 234, incorporation instruction 236, and entity audioextraction 238. Signal processing 220 includes modules for automaticspeech recognition (ASR) 230, gaze/gesture recognition 232, end ofutterance 224, and speaker determination 226. Client device 212 includesmicrophone 240, camera 242, and media module 244. Media module 244includes media content 246, resource incorporation 248, display, 250,and speaker 252. Resource database 218 includes voice over resources254, character expressions 264, and ASR models 266. Voice over resources254 include substitute audio files 256, entity audios 258, voicemodulations 260, and catch phrases 262.

System 200, incremental NLU software code 208, client device 212,resource database 218, microphone 240, camera 242, and media module 244in FIG. 2 correspond respectively in general to system 100, incrementalNLU software code 108, client device 112, resource database 118,microphone 140, camera 142, and media module 144 in FIG. 1, and thosecorresponding features may share any of the characteristics attributedto either corresponding feature by the present disclosure. Thus,although not explicitly shown in FIG. 1, like incremental NLU softwarecode 208 in FIG. 2, incremental NLU software code 108 in FIG. 1 includesfeatures corresponding to signal processing 220, intent determination222, entity recognition 228, resource retrieval 234, incorporationinstruction 236, and entity audio extraction 238.

The functionality of system 100/200 will be further described byreference to FIG. 3 in combination with FIGS. 1 and 2. FIG. 3 showsflowchart 370 presenting an exemplary method for use by a system toincorporate resources in media content based on incremental naturallanguage understanding, according to one implementation. With respect tothe method outlined in FIG. 3, it is noted that certain details andfeatures have been left out of flowchart 370 in order not to obscure thediscussion of the inventive features in the present application.

Flowchart 370 begins at action 372 with receiving an audio streamincluding a first utterance. The audio stream may be received byincremental NLU software code 108/208 from client device 112/212 vianetwork 110. The utterance may be any form of speech. For example, theutterance may be a sentence including multiple words. As anotherexample, the utterance may be a dialogue between two or more entities.As yet another example, the utterance may be a song verse includingmultiple lyrics. It is noted that the audio stream may include multipleutterances. As described above, the audio stream may be generated invarious ways. In one implementation, user 114 may interact with clientdevice 112/212 by speaking utterances, and microphone 140/240 maygenerate an audio stream including the utterances. In anotherimplementation, media module 144/244 may generate an audio streamincluding utterances based on an audio track of media content 246 to beplayed by media module 144/244.

Incremental NLU software code 108/208 may also receive video data fromclient device 112/212 via network 110. As described above, the videodata may be generated in various ways. In one implementation, user 114may interact with client device 112/212 by gazes and gestures, andcamera 142/242 may generate video data including the gazes and gestures.Camera 142/242 may include one or more still cameras, such as singleshot cameras, and/or one or more video cameras configured to capturemultiple video frames in sequence. Camera 142/242 may be a digitalcamera including a complementary metal-oxide-semiconductor (CMOS) orcharged coupled device (CCD) image sensor. Camera 142/242 may alsoinclude an infrared camera. In another implementation, media module144/244 may generate video data including gazes or gestures based on avideo track of media content 246 to be played by media module 144/244.

Flowchart 370 continues at action 374 with generating a firstincremental speech recognition output based on a first portion of thefirst utterance. The first incremental speech recognition output may begenerated by ASR 230 processing the audio stream received from clientdevice 112/212. In various implementations, ASR 230 may filterbackground noise, normalize volume, and break down the audio stream intorecognized phonemes. ASR 230 may also perform a statistical probabilityanalysis using the phonemes to deduce a whole word. Where the utteranceis a sentence, the first portion of the utterance may be, for example,the first word in the sentence, and the first incremental speechrecognition output may be a transcript corresponding to the first wordof the sentence. In another implementation, the first incremental speechrecognition output may be based on a first group of words in thesentence. More generally, the first incremental speech recognitionoutput may be based on any portion shorter than the entirety of theutterance. As used herein, the terms “first incremental speechrecognition output” and “second incremental speech recognition output”are defined relative to each, and refer to one output precedinggeneration of the other output. These terms do not require an output tobe the first output generated by ASR 230 in an absolute temporal sense.For example, the “first incremental speech recognition output” maytechnically be the fourth incremental speech recognition outputgenerated by ASR 230, while the “second incremental speech recognitionoutput” may technically be the seventh incremental speech recognitionoutput generated by ASR 230.

Incremental NLU software code 108/208 may also generate a firstincremental gaze recognition output or a first incremental gesturerecognition output based on a first portion of video data. The firstincremental gaze recognition output and/or the first incremental gesturerecognition output may be generated by gaze/gesture recognition 232processing video data from client device 112/212. For example,gaze/gesture recognition 232 may utilize a pupil center cornealreflection method to recognize an eye gaze as the first incremental gazerecognition output. As another example, gaze/gesture recognition 232 mayrecognize a predetermined hand gesture or a predetermined bodily pose asthe first incremental gesture recognition output. Gaze/gesturerecognition 232 may also correlate gaze and gesture information. Forexample, gaze/gesture recognition 232 may improve the accuracy of arecognized eye gaze using a corresponding head position. As used herein,the terms “first incremental gaze/gesture recognition output” and“second incremental gaze/gesture recognition output” are relativeterminology, and do not impose an absolute temporal requirement, asdescribed above.

Flowchart 370 continues at action 376 with generating a secondincremental speech recognition output based on a second portion of thefirst utterance. ASR 230 may generate the second incremental speechrecognition output in action 376 in a similar manner as the firstincremental speech recognition output in action 374, albeit using thesecond portion of the first utterance. For example, where the utteranceis a sentence, the second portion of the utterance may be the secondword in the sentence, and the second incremental speech recognitionoutput may be a transcript corresponding to the second word of thesentence. In other examples, the second incremental speech recognitionoutput may correspond to another word of the sentence because, asdescribed above, the terms “first” and “second” are relativeterminology, and do not impose an absolute temporal requirement.Advantageously, ASR 230 of incremental NLU software code 108/208 beginsgenerating incremental speech recognition outputs as soon as portions ofan utterance are received, rather than waiting until receiving theentirety of an utterance. As described further below, hardware processor104 can execute incremental NLU software code 108/208 to perform variousactions prior to ASR 230 generating a second incremental speechrecognition output. As also described below, hardware processor 104 canexecute incremental NLU software code 108/208 to update actions based inpart on the second incremental speech recognition output. Gaze/gesturerecognition 232 may also generate a second incremental gaze recognitionoutput and/or a second incremental gesture recognition output in asimilar manner as the first incremental gaze recognition output and/orthe first incremental gesture recognition output, albeit using a laterportion of video data from client device 112/212.

Flowchart 370 continues at action 378 with determining, prior togenerating the second incremental speech recognition output, a firstintent of the first utterance based on the first incremental speechrecognition output. Intent determination 222 may determine the firstintent of the first utterance. In one implementation, intentdetermination 222 determines that the first intent of the firstutterance includes impersonating a character. For example, the firstincremental speech recognition output may be a word in a catch phraseassociated with a particular movie character or cartoon character. Basedon the recognized word, intent determination 222 may determine that thefirst intent of the first utterance may be to impersonate thatparticular movie character or cartoon character, or to complete thecatch phrase. Intent determination 222 may utilize a probabilistic modelto determine the first intent. Intent determination 222 may be trainedon several possible intents, and may generate confidence scores based onthe first incremental speech recognition output for each of the severalpossible intents. Intent determination 222 may determine the firstintent of the first utterance to be, for example, the intentcorresponding to the highest confidence score. In one implementation,intent determination 222 may determine multiple intents of the firstutterance based on the first incremental speech recognition output. Forexample, intent determination 222 may determine that the first utterancehas two intents corresponding to the two highest confidence scores. Asanother example, intent determination 222 may determine that a firstintent of the first utterance corresponds to the highest confidencescore, and that subsequent intents of the first utterance correspond toconfidence scores that exceed a threshold. As used herein, the term“first intent” refers to an intent corresponding to the firstincremental speech recognition output. This term does not require anintent to be the first intent determined by intent determination 222 inan absolute temporal sense. For example, the “first intent” maytechnically be the fourth intent determined by intent determination 222.

In one implementation, intent determination 222 determines that thefirst intent of the first utterance includes a predetermined word. Forexample, the first incremental speech recognition output may be a wordassociated with or commonly preceding a particular predetermined word.Based on the recognized word, intent determination 222 may determinethat the first intent of the first utterance includes that particularpredetermined word. A predetermined word may be any word for whichintent determination 222 is trained. In one implementation, apredetermined word can be a prohibited word, such as a curse word. Upondetermining that the first intent of the first utterance includes aprohibited word, intent determination 222 may estimate a starting pointof the prohibited word in the first utterance, and may also estimate aduration of the prohibited word in the first utterance. In thisimplementation, intent determination 222 estimates the starting pointand duration of the prohibited word prior to the prohibited wordoccurring in the first utterance, based on the first incremental speechrecognition output. Estimating the starting point or the duration of theprohibited word in the first utterance can include determiningintervening words between the first portion of the first utterance andthe prohibited word, and/or determining a rate of speech based on thefirst incremental speech recognition output. Although the exampledescribed above refers to a predetermined word, it is noted that intentdetermination 222 may determine that the first intent of the firstutterance includes a predetermined phrase.

In one implementation, intent determination 222 determines that thefirst intent of the first utterance includes singing a song. Forexample, the first incremental speech recognition output may be a lyricof a song. In one implementation, intent determination 222 determinesthat the first intent of the first utterance includes a personalintroduction. For example, the first incremental speech recognitionoutput may be a word associated with or commonly preceding user 114introducing their name. In various implementations, intent determination222 determines that the first intent of the first utterance includesenabling subtitles, foreign language audio tracks, or other assistivefeatures for media content 146/246 played by media module 144/244. Inone implementation, intent determination 222 determines that the firstintent of the first utterance includes enabling a foreign language modeon client device 112/212.

Prior to generating the second incremental gaze recognition output orthe second incremental gesture recognition output, intent determination222 may determine the first intent of the first utterance further basedon the first incremental gaze recognition output or the firstincremental gesture recognition output. For example, intentdetermination 222 may determine the first intent from several possibleintents based on whether an eye gaze of user 114 was directed towardcamera 142/242 of client device 112/212 while speaking the first portionof the first utterance. As another example, intent determination 222 maydetermine the first intent from several possible intents based onwhether user 114 made a predetermined hand gesture or a predeterminedbodily pose while speaking the first portion of the first utterance.

Intent determination 222 may determine the first intent of the firstutterance further based on a determined first speaker corresponding tothe first incremental speech recognition output. Speaker determination226 of signal processing 220 may determine the first speakercorresponding to the first incremental speech recognition output. Forexample, speaker determination 226 may determine that user 114corresponds to the first incremental speech recognition output bycorrelating unique voice patterns of the first incremental speechrecognition output with unique voice patterns of user 114 stored inmemory 106. Alternatively, where user 114 is a new user, speakerdetermination 226 may create a new profile for user 114 in memory 106that includes the unique voice patterns of the first incremental speechrecognition output. Speaker determination 226 may also determine a firstspeaker using facial recognition of gaze/gesture recognition 232. Forexample, by correlating timestamps for a facial recognition from outputgaze/gesture recognition 232 with timestamps for the first incrementalspeech recognition output, speaker determination 226 may determine thatuser 114 corresponds to the first incremental speech recognition output.Intent determination 222 may then utilize outputs of speakerdetermination 226. For example, intent determination 222 may determinethe first intent from several possible intents based on whether user 114or another speaker corresponds to the first incremental speechrecognition output.

Entity recognition 228 may recognize an entity in the audio stream forpreservation of audio data, such as pronunciation and intonation. Entityrecognition 228 may recognize the entity from a list of predeterminedentities stored in memory 106. Alternatively, entity recognition 228 mayutilize outputs of intent determination 222 and/or ASR 230 to recognizethe entity. For example, the first incremental speech recognition outputmay include user 114 introducing their name. Where intent determination222 determines that the first intent of the first utterance includes apersonal introduction, and ASR 230 outputs a transcript that does notinclude a dictionary word or assigns the transcript a high error value,entity recognition 228 may recognize that the transcript includes aname. Upon determining that first incremental speech recognition outputincludes a name, entity recognition 228 may instruct entity audioextraction 238 to extract an audio portion corresponding to therecognized name from the first portion of the first utterance. Forexample, the first portion of the first utterance may be temporarilystored by incremental NLU software code 108/208, and entity audioextraction 238 may extract an audio portion including phonemes or otheraudio data corresponding to the recognized name from the first portionof the first utterance for permanent storage. Entity audio extraction238 may then instruct incremental NLU software code 108/208 to store theextracted audio portion, for example, in resource database 118/218.Although the example described above refers to recognizing a name, anentity recognized by entity recognition 228 and extracted by entityaudio extraction 238 can be any word or phrase for preservation of audiodata, such as a title, proper noun, interjection, or word commonlyspoken with two different pronunciations.

Flowchart 370 continues at action 380 with retrieving a first resourcebased on the determined first intent. The first resource may beretrieved from resource database 118/218 by resource retrieval 234 vianetwork 110. Resource retrieval 234 may formulate a resource retrievalrequest, and may then instruct incremental NLU software code 108/208 totransmit the resource retrieval request to resource database 118/218,for example, using a wireless transmitter. Resource database 118/218 mayidentify the first resource among resources stored therein based on aretrieval request, and may transmit the first resource to incrementalNLU software code 108/208. As used herein, the term “first resource”refers to a resource corresponding to the first intent. This term doesnot require a resource to be the first resource retrieved by resourceretrieval 234 in an absolute temporal sense. For example, the “firstresource” may technically be the fourth resource retrieved by resourceretrieval 234.

Where the determined first intent of the first utterance includes aprohibited word, the first resource retrieved by resource retrieval 234may be one of substitute audio files 256. Substitute audio files 256 mayinclude high pitch bleeps such as those commonly used to censorprohibited words on television or radio broadcasts. In otherimplementations, substitute audio files 256 may include any type ofaudio signal. In one implementation, a duration of the retrievedsubstitute audio file is approximately the duration of the prohibitedword, as estimated by intent determination 222. For example, resourceretrieval 234 may include the estimated duration of the prohibited wordin a resource retrieval request transmitted to resource database118/218, and resource database 118/218 may be configured to determineone of substitute audio files 256 having a duration that most closelymatches the estimated duration of the prohibited word.

Where the determined first intent of the first utterance includespronunciation of a user's name, such as in a personalized message, thefirst resource retrieved by resource retrieval 234 may be one of entityaudios 258 corresponding to that particular user. For example, theretrieved entity audio may be an audio portion including the correctpronunciation of the name of user 114, as extracted by entity audioextraction 238. Where the determined first intent of the first utteranceis to impersonate a particular movie character or cartoon character, thefirst resource retrieved by resource retrieval 234 may be one of voicemodulations 260 corresponding to that particular movie or cartooncharacter. Voice modulations 260 can include instructions for pitchshifting, distorting, and filtering audio data. For example, voicemodulations 260 can include instructions for transforming audio data tobe similar to unique voice patterns of the particular movie or cartooncharacter. Where the determined first intent of the first utterance isto complete a catch phrase of a particular movie character or cartooncharacter, the first resource retrieved by resource retrieval 234 may beone of catch phrases 262 corresponding to that particular moviecharacter or cartoon character.

Voice over resources 254 can include other resources not depicted inFIG. 2. For example, where the determined first intent of the firstutterance is to sing a song, the first resource retrieved by resourceretrieval 234 may be the particular song, or an instrumentalaccompaniment omitting the audio lyrics of the particular song. It isnoted that although substitute audio files 256, voice modulations 260,entity audios 258, and catch phrases 262 are depicted as voice overresources 254 in the present implementation, in various implementations,resources in resource database 218 may take any form including text,video, or any resource capable of being incorporated in media content246 to be played by media module 144/244, as described further below.For example, where the determined first intent of the first utteranceincludes enabling a foreign language mode on client device 112/212, thefirst resource retrieved by resource retrieval 234 may be the subtitlesor foreign language audio tracks. As another example, where thedetermined first intent of the first utterance includes interacting witha virtual character, such as a virtual character displayed on display250, the first resource retrieved by resource retrieval 234 may be oneof character expressions 264. Character expressions 264 may be videos or2D or 3D model animations of a virtual character performing variousactions, such as frowning or jumping. In one implementation, resourceretrieval 234 may retrieve multiple resources based on the determinedfirst intent.

Incremental NLU software code 108/208 may also retrieve resources basedon the determined first intent where the resources are not specificallyfor incorporation in media content 246 to be played by media module144/244. For example, where the determined first intent of the firstutterance includes enabling a foreign language mode on client device112/212, the first resource retrieved by resource retrieval 234 may beone of ASR models 266. ASR 230 may employ the retrieved ASR model toprocess the audio stream from client device 112/212 in a differentmanner. For example, different ASR models 266 can be employed togenerate incremental speech recognition outputs for a language, dialect,or accent corresponding to the particular foreign language mode. Asanother example, in order to avoid biasing generated incremental speechrecognition outputs towards common or expected words, such as wordsrecognized in an official or predetermined dictionary, where thedetermined first intent of the first utterance includes a personalintroduction, ASR 230 may employ the retrieved ASR model to negate ormitigate biasing of subsequently generated incremental speechrecognition outputs towards a common or expected word, such that entityrecognition 228 can more easily recognize an entity or name from thesubsequently generated incremental speech recognition outputs.Incremental NLU software code 108/208 may retrieve one of ASR models 266in addition to resources for incorporation in media content 246 to beplayed by media module 144/244, such as foreign language subtitles audiotracks, catch phrases, song files, etc. It is noted that although FIG. 2depicts resources as residing in resource database 218, in someimplementations, resources may reside in memory 106 of server 102.

Flowchart 370 continues at action 382 with incorporating the firstresource in a media content to be played by a media module. The firstresource may be incorporated in media content 246 to be played on mediamodule 144/244. Media module 144/244 may be an application running onclient device 112/212. Media module 144/244 includes media content 246,resource incorporation 248, display, 250, and speaker 252. Incorporationinstruction 236 of incremental NLU software code 108/208 may provideinstructions to resource incorporation 248 of media module 144/244regarding how to incorporate the first resource in media content 246.Incorporation instruction 236 may also provide the first resource tomedia module 144/244. In one implementation, incorporation instruction236 provides the instructions as metadata together with the firstresource. Resource incorporation 248 of media module 144/244 parses theinstructions and incorporates the first resource in media content 246.As described further below, incorporation of the first resource in mediacontent 246 can entail replacing a portion of media content 246 with thefirst resource, playing the first resource along with media content 246,or other means of incorporation.

As described above, media content 246 may be movie content, televisionprogramming content, or VODs, for example, including ultrahigh-definition (ultra HD), HD, or standard-definition (SD) basebandvideo with embedded audio, captions, timecode, and other ancillary data,such as ratings and/or parental guidelines. In some implementations,media content 246 may include multiple audio tracks, and may utilizesecondary audio programming (SAP) and/or Descriptive Video Service(DVS), for example. Media content 246 may be a live broadcast video.Media content 246 may also include 2D or 3D model animations. It is alsonoted that although FIG. 2 depicts media content 246 as residing onmedia module 144/244, in some implementations, media content 246 mayreside in media content data base 116 (shown in FIG. 1) or in memory106. In such implementations, media content 246 may be may betransmitted to media module 144/244 from media content database 116 orfrom memory 106 of server 102 (shown in FIG. 1). In anotherimplementation, media content may be generated by media module 144/244itself.

Media module 244 can play media content 246 and the first resource usingdisplay 250 and/or speaker 252. Display 250 may be implemented as aliquid crystal display (LCD), a light-emitting diode (LED) display, anorganic light-emitting diode (OLED) display, or any other suitabledisplay screen that performs a physical transformation of signals tolight. Speaker 252 may be implemented as a micro-electrical mechanicalsystems (MEMS) speaker, a speaker array, or any other suitable speakerthat performs a physical transformation of signals to sound. It is notedthat although FIG. 2 depicts media module 244 as including a singledisplay 250 and a single speaker 252, that representation is also merelyprovided as an aid to conceptual clarity. More generally, media module244 may include one or more displays and/or speakers, which may beco-located, or interactively linked but distributed.

In one implementation, media content 246 is a live broadcast beingplayed by media module 144/244 concurrently as it is generated by mediamodule 144/244, for example, based on audio or video data received frommicrophone 140/240 and/or camera 142/242 respectively. Client device112/212 may transmit an audio stream including the first utterance ofthe live broadcast to server 102. Where incremental NLU software code108/208 determines that the first intent of the first utterance includesa prohibited word, one of substitute audio files 256 retrieved fromresource database 118/218 may be substituted for the prohibited word.For example, as described above, intent determination 222 may estimate astarting point of the prohibited word in the live broadcast. Thenincorporation instruction 236 and resource incorporation 248 cansubstitute an audio portion of the live broadcast including theprohibited word with the substitute audio file at the estimated startedpoint. Thus, system 100/200 may censor a live broadcast withoutintroducing a time delay and without requiring prior knowledge of thecontents of the live broadcast.

In one implementation, client device 112/212 is a voice modulator andmedia content 246 is an audio stream being played by speaker 252 ofmedia module 144/244 concurrently as it is generated by media module144/244, for example, based on audio data received from microphone140/240. Client device 112/212 may transmit the audio stream including afirst utterance to server 102. Where incremental NLU software code108/208 determines that the first intent of the first utterance is toimpersonate a particular movie character or cartoon character,incorporation instruction 236 and resource incorporation 248 may applyone of voice modulations 260 corresponding to the particular movie orcartoon character retrieved from resource database 118/218 to the audiostream being played by speaker 252 of media module 144/244, the audiostream is transformed with unique voice patterns of the particular movieor cartoon character. Thus, system 100/200 may achieve voice modulationof subsequent portions of an audio stream after only receiving a firstportion thereof. In one implementation, media module 144/244 mayintroduce a delay between microphone 140/240 generating the audio streamand speaker 252 of media module playing the audio stream, in order toretroactively apply one of voice modulations 260 to the first portion ofthe audio stream, which incremental NLU software code 108/208 utilizedto determine the first intent and retrieve the corresponding one ofvoice modulations 260.

In one implementation, media content 246 played by media module 144/244is a message personalized for user 114. Where incremental NLU softwarecode 108/208 determines that the first intent of the first utteranceincludes utilizing the name of user 114, incorporation instruction 236and resource incorporation 248 may incorporate one of entity audios 258corresponding to user 114 retrieved from resource database 118/218 intothe personalized message. In one implementation, an audio portionincluding the pronunciation of the recognized name of user 114, asextracted by entity audio extraction 238, is incorporated into thepersonalized message. In another implementation, text including thespelling of the name of user 114 is incorporated into the personalizedmessage, for example, by display 250.

Media content 246 played by media module 144/244 may also bepersonalized by incorporating an audio portion corresponding to arecognized entity other than a name. For example, media content 246played by media module 144/244 may be an audio encyclopedia or aninteractive search engine. Where incremental NLU software code 108/208determines that the first intent of the first utterance includeslearning about or discussing pecans, incorporation instruction 236 andresource incorporation 248 may incorporate one of entity audios 258retrieved from resource database 118/218, such that an audio portionincluding the user's specific pronunciation of the word “pecan” (e.g.,“pee-can” versus “puh-can”), as extracted by entity audio extraction238, is incorporated into an encyclopedia entry or search result. By thesame token, any media content 246 playing audio for the word “Caribbean”may incorporate one of entity audios 258 having an accent on the firstsyllable (e.g., “KAR-i-bee-in) or one of entity audios 258 having anaccent on the second syllable (e.g., “ka-RIB-ee-in) depending on theaudio portion extracted by entity audio extraction 238.

In one implementation, media content 246 played by media module 144/244is an interactive game that prompts user 114 to speak a catch phrase.Where the determined first intent of the first utterance is a catchphrase of a particular movie character or cartoon character,incorporation instruction 236 and resource incorporation 248 mayincorporate one of catch phrases 262 corresponding to that particularmovie character or cartoon character retrieved from resource database118/218 into the interactive game. For example, media module 144/244 maydisplay the retrieved catch phrase on display 250 as user 114 begins tospeak the first word of the catch phrase. As another example, mediamodule 144/244 may play the catch phrase using speaker 252.

In a similar implementation, media content 246 played by media module144/244 is an interactive foreign language education program thatprompts user 114 to speak a sentence in a foreign language. Where user114 correctly pronounces the first word of the sentence in the foreignlanguage, incremental NLU software code 108/208 may determine the firstintent of the first utterance is to complete the sentence, and mayincorporate appropriate resources in the foreign language educationprogram. For example, media module 244 may display one or more remainingwords of the foreign language sentence on display 250 as user 114 speaksthe first word. As another example, media module 144/244 may play one ormore remaining words of the foreign language sentence using speaker 252so that user 114 may speak along.

In one implementation, media content 246 played by media module 144/244is a karaoke application. Where user 114 sings the first lyric(s) of asong, incremental NLU software code 108/208 may determine that the firstintent of the first utterance is to sing a particular song.Incorporation instruction 236 and resource incorporation 248 mayincorporate appropriate resources corresponding to that particular song,which is retrieved from resource database 118/218 into the karaokeapplication. For example, media module 244 may play the particular song,or an instrumental accompaniment omitting the audio lyrics of theparticular song, using speaker 252 as user 114 begins to sing the firstlyric(s). As another example, media module 144/244 displays theremaining lyrics of the particular song on display 250.

In one implementation, media content 246 played by media module 144/244is a movie. User 114 may begin speaking an utterance to microphone140/240 commanding the client device 112/212 to enable a foreignlanguage mode. When incremental NLU software code 108/208 determinesthat the first intent of the first utterance includes enabling theforeign language mode, incorporation instruction 236 and resourceincorporation 248 may incorporate subtitles or foreign language audiotracks corresponding to the foreign language mode retrieved fromresource database 118/218 into the movie via display 250 or speaker 252.Thus, system 100/200 achieves predictive voice-activated control formedia module 144/244.

In one implementation, media content 246 played by media module 144/244is an interactive story displaying virtual characters. Where thedetermined first intent of the first utterance includes describing anevent that happens to the virtual characters, incorporation instruction236 and resource incorporation 248 may incorporate one of characterexpressions 264 corresponding to the event retrieved from resourcedatabase 118/218 into the virtual character. For example, media module144/244 may display 2D or 3D model animations of a virtual characterperforming a new action, such as frowning or jumping, on display 250. Asanother example, media module 144/244 may display a new virtualcharacter corresponding to the event on display 250.

Advantageously, incremental NLU software code 108/208 begins processingas soon as the first portion of an utterance is recognized, rather thanwaiting until recognizing the entirety of an utterance. Thus,incremental NLU software code 108/208 may determine intents, retrieveresources, and incorporate resources in media content 246 with reducedlatency.

As described above with respect to action 376, incremental NLU softwarecode 108/208 may continue incremental processing by generating thesecond incremental speech recognition output based on the second portionof the first utterance in a similar manner as the first incrementalspeech recognition output. After generating the second incrementalspeech recognition output, intent determination 222 may update the firstintent based on both the first and second incremental speech recognitionoutputs. For example, where user 114 begins to sing a first lyric thatis shared by two songs and intent determination 222 determined that thefirst intent included singing the first song, after user 114 sings asecond lyric that is specific to the second song, intent determination222 may update the first intent to include singing the second songinstead. In this example, resource retrieval 234 may retrieve a secondresource from resource database 118/218 based on the updated firstintent, and incorporation instruction 236 and resource incorporation 248may incorporate the second resource in media content 246 being played bymedia module 144/244. In a similar manner, intent determination 222 mayupdate the first intent of the first utterance using a secondincremental gaze recognition output or a second incremental gesturerecognition output.

However, where user 114 begins to sing a first lyric that is shared bytwo songs and intent determination 222 determined that the first intentincluded singing the first song, after user 114 sings a second lyricthat is specific to the first song, intent determination 222 may updatethe first intent by assigning a higher confidence score to the firstsong. In this example, resource retrieval 234 need not retrieve a secondresource from resource database 118/218. In various implementations,system 100/200 can request and receive confirmation from user 114regarding the accuracy of the first intent determined based on the firstincremental speech recognition output. For example, microphone 140/240,camera 142/242, or a touchscreen on display 250 can receive an inputindicating confirmation from user 114.

Speaker determination 226 of signal processing 220 may determine aspeaker corresponding to a second incremental speech recognition outputin a similar manner described above. For example, speaker determination226 may determine that user 114 corresponds to the second incrementalspeech recognition output. In another example, speaker determination 226may determine that a second speaker (different from user 114corresponding to the first incremental speech recognition output)corresponds to the second incremental speech recognition output of thefirst utterance. In other words, system 100/200 can incrementallydetermine whether a speaker change has occurred in an utterance. Intentdetermination 222 may determine a second intent for the first utterancebased on the determined second speaker. Intent determination 222 maytrack both the first and second intents, for example, where theutterance is a dialogue and each speaker has corresponding intents.Intent determination 222 may update the second intent in a similarmanner as the first intent. Resource retrieval 234 may retrieve a secondresource from resource database 118/218 based on the second intent, andincorporation instruction 236 and resource incorporation 248 mayincorporate the second resource in media content 246 being played bymedia module 144/244 instead of or in addition to the first resource.

End of utterance 224 of signal processing 220 may determine that aportion of an utterance is the end of the utterance. For example, byanalyzing outputs from ASR 230 and/or gaze/gesture recognition 232, endof utterance 224 may determine that the second portion of the firstutterance is the end of the first utterance. Intent determination 22 mayutilize the output from end of utterance 224 when determining orupdating an intent. For example, where intent determination 222 updatedthe first intent based on the first and second incremental speechrecognition outputs, after end of utterance 224 determines that thesecond portion of the first utterance is the end of the first utterance,intent determination may update the updated first intent based on thedetermined end of the first utterance.

Thus, the present application discloses various implementations ofsystems for incremental natural language understanding, as well asmethods for use by such systems. From the above description it ismanifest that various techniques can be used for implementing theconcepts described in the present application without departing from thescope of those concepts. Moreover, while the concepts have beendescribed with specific reference to certain implementations, a personof ordinary skill in the art would recognize that changes can be made inform and detail without departing from the scope of those concepts. Assuch, the described implementations are to be considered in all respectsas illustrative and not restrictive. It should also be understood thatthe present application is not limited to the particular implementationsdescribed herein, but many rearrangements, modifications, andsubstitutions are possible without departing from the scope of thepresent disclosure.

What is claimed is:
 1. A system comprising: a media module configured toplay a media content; a memory storing a software code; a hardwareprocessor communicatively coupled to the media module, and configured toexecute the software code to: receive an audio stream including a firstutterance; generate a first incremental speech recognition output basedon a first portion of the first utterance; generate a second incrementalspeech recognition output based on a second portion of the firstutterance; determine, prior to generating the second incremental speechrecognition output, a first intent of the first utterance based on thefirst incremental speech recognition output; retrieve a first resourcebased on the determined first intent; and incorporate the first resourcein the media content to be played by the media module.
 2. The system ofclaim 1, wherein: the determined first intent of the first utterancecomprises a predetermined word; the first resource comprises asubstitute audio file; and the incorporating the first resource in themedia content comprises substituting an audio portion of the mediacontent including the predetermined word with the substitute audio file.3. The system of claim 2, wherein the retrieving the first resourcebased on the determined first intent comprises: estimating a duration ofthe predetermined word in the first utterance; and determining that aduration of the substitute audio file is approximately the duration ofthe predetermined word.
 4. The system of claim 2, wherein: the hardwareprocessor is further configured to estimate a starting point of thepredetermined word in the first utterance, and substitute the audioportion of the media content including the predetermined word with thesubstitute audio file at the estimated starting point.
 5. The system ofclaim 1, wherein the hardware processor is further configured to:recognize an entity in the audio stream; extract an audio portioncorresponding to the recognized entity from the audio stream; and storethe audio portion in a resource database; wherein the incorporating thefirst resource in the media content includes incorporating the audioportion corresponding to the recognized entity in the media content. 6.The system of claim 1, wherein the hardware processor is furtherconfigured to: determine a first speaker corresponding to the firstincremental speech recognition output, wherein the determining the firstintent of the first utterance is further based on the determined firstspeaker; determine a second speaker corresponding to the secondincremental speech recognition output; and determine a second intent ofthe first utterance based on the determined second speaker.
 7. Thesystem of claim 1, wherein the hardware processor is further configuredto update, after generating the second incremental speech recognitionoutput, the first intent of the first utterance based on the firstincremental speech recognition output and the second incremental speechrecognition output.
 8. The system of claim 7, wherein the hardwareprocessor is further configured to: determine that the second portion ofthe first utterance comprises an end of the first utterance; and updatethe updated first intent based on the end of the first utterance.
 9. Thesystem of claim 7, wherein the hardware processor is further configuredto: retrieve a second resource based on the updated first intent; andincorporate the second resource in the media content to be played by themedia module.
 10. The system of claim 1, wherein the hardware processoris further configured to: receive video data; generate an incrementalgaze recognition output or an incremental gesture recognition outputbased on a portion of the video data; and determine the first intent ofthe first utterance further based on the incremental gaze recognitionoutput or the incremental gesture recognition output.
 11. A method foruse by a system including a media module configured to play a mediacontent, a memory storing a software code, and a hardware processorcommunicatively coupled to the media module, the method comprising:receiving, using the hardware processor, an audio stream including firstutterance; generating, using the hardware processor, a first incrementalspeech recognition output based on a first portion of the firstutterance; generating, using the hardware processor, a secondincremental speech recognition output based on a second portion of thefirst utterance; determining, prior to generating the second incrementalspeech recognition output and using the hardware processor, a firstintent of the first utterance based on the first incremental speechrecognition output; retrieving, using the hardware processor, a firstresource based on the determined first intent; and incorporating, usingthe hardware processor, the first resource in the media content to beplayed by the media module.
 12. The method of claim 11, wherein: thedetermined first intent of the first utterance comprises speaking apredetermined word; the first resource comprises a substitute audiofile; and the incorporating the first resource in the media contentcomprises substituting an audio portion of the media content includingthe predetermined word with the substitute audio file.
 13. The method ofclaim 12, wherein the retrieving the first resource based on thedetermined first intent comprises: estimating, using the hardwareprocessor, a duration of the predetermined word in the first utterance;and determining, using the hardware processor, that a duration of thesubstitute audio file is approximately the duration of the predeterminedword.
 14. The method of claim 12, further comprising: estimating, usingthe hardware processor, a starting point of the predetermined word inthe first utterance; and substituting, using the hardware processor, theaudio portion of the media content including the predetermined word withthe substitute audio file at the estimated starting point.
 15. Themethod of claim 11, further comprising: recognizing, using the hardwareprocessor, an entity in the audio stream; extracting, using the hardwareprocessor, an audio portion corresponding to the recognized entity fromthe audio stream; and storing, using the hardware processor, the audioportion in a resource database; wherein the incorporating the firstresource in the media content includes incorporating the audio portioncorresponding to the recognized entity in the media content.
 16. Themethod of claim 11, further comprising: determining, using the hardwareprocessor, a first speaker corresponding to the first incremental speechrecognition output, wherein the determining the first intent of thefirst utterance is further based on the determined first speaker;determining, using the hardware processor, a second speakercorresponding to the second incremental speech recognition output; anddetermining, using the hardware processor, a second intent of the firstutterance based on the determined second speaker.
 17. The method ofclaim 11, further comprising updating, after generating the secondincremental speech recognition output and using the hardware processor,the first intent of the first utterance based on the first incrementalspeech recognition output and the second incremental speech recognitionoutput.
 18. The method of claim 17, further comprising: determining,using the hardware processor, that the second portion of the firstutterance comprises an end of the first utterance; and updating, usingthe hardware processor, the updated first intent based on the end of thefirst utterance.
 19. The method of claim 17, further comprising:retrieving, using the hardware processor, a second resource based on theupdated first intent; and incorporating, using the hardware processor,the second resource in the media content to be played by the mediamodule.
 20. The method of claim 11, further comprising: receiving, usingthe hardware processor, video data; generating, using the hardwareprocessor, an incremental gaze recognition output or an incrementalgesture recognition output based on a portion of the video data; anddetermining, using the hardware processor, the first intent of the firstutterance further based on the incremental gaze recognition output orthe incremental gesture recognition output.